Data Analysis & Visualization - Michael Cheng¶
Project Problem Statement - AllLife Bank Customer Segmentation¶
Background¶
Context
AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their marketing research team that the penetration in the market can be improved. Based on this input, the marketing team proposes to run personalized campaigns to target new customers as well as upsell to existing customers.
Another insight from the market research was that the customers perceive the support services of the bank poorly. Based on this, the operations team wants to upgrade the service delivery model, to ensure that customers' queries are resolved faster. The head of marketing and the head of delivery, both decide to reach out to the Data Science team for help.
Objective
Identify different segments in the existing customer base, taking into account their spending patterns as well as past interactions with the bank.
Data Description: Data is available on customers of the bank with their credit limit, the total number of credit cards the customer has, and different channels through which the customer has contacted the bank for any queries. These different channels include visiting the bank, online, and through a call center.
Sl_no - Customer Serial Number
Customer Key - Customer identification
Avg_Credit_Limit - Average credit limit (currency is not specified, you can make an assumption around this)
Total_Credit_Cards - Total number of credit cards
Total_visits_bank - Total bank visits
Total_visits_online - Total online visits
Total_calls_made - Total calls made
Import Libraries & Load Data¶
import pandas as pd
# Importing PCA and t-SNE
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# Summary Tools
from summarytools import dfSummary
data2 = pd.read_excel("/mnt/e/mikecbos_E/Downloads/MIT_Elective-AllLife/Credit+Card+Customer+Data.xlsx")
Data Preprocessing¶
# Copy of data
df2 = data2.copy()
# Overview of data
print(df2.head())
df2.info()
dfSummary(df2)
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards \ 0 1 87073 100000 2 1 2 38414 50000 3 2 3 17341 50000 7 3 4 40496 30000 5 4 5 47437 100000 6 Total_visits_bank Total_visits_online Total_calls_made 0 1 1 0 1 0 10 9 2 1 3 4 3 1 1 4 4 0 12 3 <class 'pandas.core.frame.DataFrame'> RangeIndex: 660 entries, 0 to 659 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sl_No 660 non-null int64 1 Customer Key 660 non-null int64 2 Avg_Credit_Limit 660 non-null int64 3 Total_Credit_Cards 660 non-null int64 4 Total_visits_bank 660 non-null int64 5 Total_visits_online 660 non-null int64 6 Total_calls_made 660 non-null int64 dtypes: int64(7) memory usage: 36.2 KB
| No | Variable | Stats / Values | Freqs / (% of Valid) | Graph | Missing |
|---|---|---|---|---|---|
| 1 | Sl_No [int64] |
Mean (sd) : 330.5 (190.7) min < med < max: 1.0 < 330.5 < 660.0 IQR (CV) : 329.5 (1.7) |
660 distinct values | 0 (0.0%) |
|
| 2 | Customer Key [int64] |
Mean (sd) : 55141.4 (25627.8) min < med < max: 11265.0 < 53874.5 < 99843.0 IQR (CV) : 43377.2 (2.2) |
655 distinct values | 0 (0.0%) |
|
| 3 | Avg_Credit_Limit [int64] |
Mean (sd) : 34574.2 (37625.5) min < med < max: 3000.0 < 18000.0 < 200000.0 IQR (CV) : 38000.0 (0.9) |
110 distinct values | 0 (0.0%) |
|
| 4 | Total_Credit_Cards [int64] |
1. 4 2. 6 3. 7 4. 5 5. 2 6. 1 7. 3 8. 10 9. 9 10. 8 |
151 (22.9%) 117 (17.7%) 101 (15.3%) 74 (11.2%) 64 (9.7%) 59 (8.9%) 53 (8.0%) 19 (2.9%) 11 (1.7%) 11 (1.7%) |
0 (0.0%) |
|
| 5 | Total_visits_bank [int64] |
1. 2 2. 1 3. 0 4. 3 5. 5 6. 4 |
158 (23.9%) 112 (17.0%) 100 (15.2%) 100 (15.2%) 98 (14.8%) 92 (13.9%) |
0 (0.0%) |
|
| 6 | Total_visits_online [int64] |
Mean (sd) : 2.6 (2.9) min < med < max: 0.0 < 2.0 < 15.0 IQR (CV) : 3.0 (0.9) |
16 distinct values | 0 (0.0%) |
|
| 7 | Total_calls_made [int64] |
Mean (sd) : 3.6 (2.9) min < med < max: 0.0 < 3.0 < 10.0 IQR (CV) : 4.0 (1.3) |
11 distinct values | 0 (0.0%) |
Preliminary Observations
Dataset contains 660 rows, 7 columns with no missing values; all values are integers, representing customer data (credit and bank interactions)
The features align naturally to the following cateogires: CustomerID; CreditProfile; BankInteraction
a. Customer ID: SI_No and Customer Key
b. Credit Profile: Avg_Credit_Limit and Total_Credit_Cards
c. Bank Interaction): Total_visits_bank, Total_visits_online, and Total_calls_made
Customer Serial Number (Sl_No) has 660 distinct records whereas Customer Identification (Customer Key) has 655 distinct records; Need to review and verify for duplicates
Statistically:
a. Avg_Credit_Limit has the highest coefficient of variation (CV), thus substantial heterogenity
b. As a whole, BankInteraction metrics have moderate variability, with CV roughly between 1.0 and 1.5
Total_visits_bank has a limited range (0 to 5), with most not exceeding 5 visits; this implies customers' interaction is less reliant on the traditional brick-and-mortar aproach to banking
Total_visits_online has a wide range (0 to 15) with high variability (standard deviation 2.9 with a mean of 2.6) compared to physical visits, confirming customers' reliance on virtual over physical banking interactions; this contrasts the BankingInteraction metrics as a whole, and will benefit from deeper exploration
Total_calls_made has a relatively consistent variance (standard deviation 2.9 with at a mean at 3.6), with a long tail extending to the right; this makes up a group of outliers, where the subset of customers make significantly more calls than the majority, and will benefit from deeper exploration
c. Total_Credit_Cards show a low variance (standard deviation of 2.2), suggesting a stable distribution across the population
d. Long Tails are evident in Avg_Credit_Limit, Total_visits_online, and Total_calls_made, and will benefit from deeper exploration into their respective outliers
The CreditProfile category may represent a "low hanging fruit" investigation opportunity to discover potential hidden relationships, due to the high variability of Avg_Credit_Limit juxtaposed with the low variability of Total_Credit_Cards
Decision Point¶
- CustomerID fields are categorical, and can risk introducing noise to the analysis and clustering
- Create Customer_ID by concatenating Customer Key with SI_No to distinguish between records (perhaps due to historical transactions, shared access within the household, different purposed accounts for the same customer, etc.)
- SI_No and Customer Key then can be dropped
- The new Customer_ID can then be indexed as necessary in subsequent studies
CustomerID¶
# Inspect duplicates
duplicate_keys = df2[df2['Customer Key'].duplicated(keep=False)]
duplicate_keys.groupby('Customer Key').size().reset_index(name='Frequency')
duplicate_keys.sort_values(by='Customer Key')
| Sl_No | Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
|---|---|---|---|---|---|---|---|
| 48 | 49 | 37252 | 6000 | 4 | 0 | 2 | 8 |
| 432 | 433 | 37252 | 59000 | 6 | 2 | 1 | 2 |
| 4 | 5 | 47437 | 100000 | 6 | 0 | 12 | 3 |
| 332 | 333 | 47437 | 17000 | 7 | 3 | 1 | 0 |
| 411 | 412 | 50706 | 44000 | 4 | 5 | 0 | 2 |
| 541 | 542 | 50706 | 60000 | 7 | 5 | 2 | 2 |
| 391 | 392 | 96929 | 13000 | 4 | 5 | 0 | 0 |
| 398 | 399 | 96929 | 67000 | 6 | 2 | 2 | 2 |
| 104 | 105 | 97935 | 17000 | 2 | 1 | 2 | 10 |
| 632 | 633 | 97935 | 187000 | 7 | 1 | 7 | 0 |
# Create the Customer_ID by concatenating Customer Key and Sl_No
df2['Customer_ID'] = df2['Customer Key'].astype(str) + "_" + df2['Sl_No'].astype(str)
# Review the updated DataFrame
print(df2[['Customer Key', 'Sl_No', 'Customer_ID']].head(20))
Customer Key Sl_No Customer_ID 0 87073 1 87073_1 1 38414 2 38414_2 2 17341 3 17341_3 3 40496 4 40496_4 4 47437 5 47437_5 5 58634 6 58634_6 6 48370 7 48370_7 7 37376 8 37376_8 8 82490 9 82490_9 9 44770 10 44770_10 10 52741 11 52741_11 11 52326 12 52326_12 12 92503 13 92503_13 13 25084 14 25084_14 14 68517 15 68517_15 15 55196 16 55196_16 16 62617 17 62617_17 17 96463 18 96463_18 18 39137 19 39137_19 19 14309 20 14309_20
# drop original CustomerID fields
df2 = df2.drop(['Customer Key', 'Sl_No'], axis = 1)
df2
| Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | Customer_ID | |
|---|---|---|---|---|---|---|
| 0 | 100000 | 2 | 1 | 1 | 0 | 87073_1 |
| 1 | 50000 | 3 | 0 | 10 | 9 | 38414_2 |
| 2 | 50000 | 7 | 1 | 3 | 4 | 17341_3 |
| 3 | 30000 | 5 | 1 | 1 | 4 | 40496_4 |
| 4 | 100000 | 6 | 0 | 12 | 3 | 47437_5 |
| ... | ... | ... | ... | ... | ... | ... |
| 655 | 99000 | 10 | 1 | 10 | 0 | 51108_656 |
| 656 | 84000 | 10 | 1 | 13 | 2 | 60732_657 |
| 657 | 145000 | 8 | 1 | 9 | 1 | 53834_658 |
| 658 | 172000 | 10 | 1 | 15 | 0 | 80655_659 |
| 659 | 167000 | 9 | 0 | 12 | 2 | 80150_660 |
660 rows × 6 columns
df2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 660 entries, 0 to 659 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Avg_Credit_Limit 660 non-null int64 1 Total_Credit_Cards 660 non-null int64 2 Total_visits_bank 660 non-null int64 3 Total_visits_online 660 non-null int64 4 Total_calls_made 660 non-null int64 5 Customer_ID 660 non-null object dtypes: int64(5), object(1) memory usage: 31.1+ KB
CreditProfile¶
# Preliminary bivariate analysis
import matplotlib.pyplot as plt
import seaborn as sns
# Create a figure with two subplots: one for scatter plot and one for box plot
fig, ax = plt.subplots(2, 1, figsize=(10, 12), sharex=True, gridspec_kw={'height_ratios': [1, 3]})
# Scatter Plot
sns.scatterplot(
data=df2,
x='Total_Credit_Cards',
y='Avg_Credit_Limit',
ax=ax[0],
alpha=0.7
)
ax[0].set_title('Scatter Plot of Avg_Credit_Limit vs Total_Credit_Cards')
ax[0].set_ylabel('Avg_Credit_Limit')
ax[0].grid(visible=True)
# Box Plot
sns.boxplot(
data=df2,
x='Total_Credit_Cards',
y='Avg_Credit_Limit',
ax=ax[1]
)
ax[1].set_title('Box Plot of Avg_Credit_Limit Across Total_Credit_Cards')
ax[1].set_xlabel('Total_Credit_Cards')
ax[1].set_ylabel('Avg_Credit_Limit')
ax[1].grid(visible=True)
# Adjust layout
plt.tight_layout()
plt.show()
Observations¶
- Consistent with intuition: The more credit cards a customer has, the higher their credit limit
- Few outliers exist in the lower credit card groups, but are less frequent than in the higher groups. These outliers seem meaningful for further analysis, thus K-Medoids will be effective to incorporate these data points proportionally
# Kernel Density Estimation: Evaluate the high variability of Avg_Credit_Limit vs low variability of Total_Credit_Cards
# Coefficient of Variation
cv_credit_limit = (df2['Avg_Credit_Limit'].std() / df2['Avg_Credit_Limit'].mean()) * 100
cv_credit_cards = (df2['Total_Credit_Cards'].std() / df2['Total_Credit_Cards'].mean()) * 100
print(f"Raw Scores - Coefficient of Variation:")
print(f"CV of Avg_Credit_Limit: {cv_credit_limit:.2f}%")
print(f"CV of Total_Credit_Cards: {cv_credit_cards:.2f}%")
import numpy as np
# Log transformation for Avg_Credit_Limit to adjust scale
df2['Log_Avg_Credit_Limit'] = np.log1p(df2['Avg_Credit_Limit']) # Use log(1 + x) to handle zero values if present
# Overlayed Density Plot
plt.figure(figsize=(10, 6))
sns.kdeplot(df2['Log_Avg_Credit_Limit'], label='Log(Avg_Credit_Limit)', fill=True, color='blue', alpha=0.7)
sns.kdeplot(df2['Total_Credit_Cards'], label='Total_Credit_Cards', fill=True, color='orange', alpha=0.7)
# Plot Titles and Labels
plt.title('Overlayed Distributions of Credit Limit and Total Credit Cards', fontsize=14)
plt.xlabel('Distribution', fontsize=12)
plt.xticks([])
plt.yticks([])
plt.text(0.5, 0.95, "Note: Each region is independent and proportional to its own scale.",
fontsize=8,
color="gray",
ha="center",
va="center",
transform=plt.gca().transAxes
)
plt.text(0.01, -0.05, "Less <---",transform=plt.gca().transAxes, fontsize=12, ha='left', va='center')
plt.text(0.99, -0.05, '---> More', transform=plt.gca().transAxes, fontsize=12, ha='right', va='center')
plt.ylabel('Concentration of Customers', fontsize=12)
plt.legend(fontsize=10)
plt.grid(visible=True)
# Display the plot
plt.tight_layout()
plt.show()
Raw Scores - Coefficient of Variation: CV of Avg_Credit_Limit: 108.83% CV of Total_Credit_Cards: 46.06%
Observations_KDE¶
- Log transformation was needed due to the scaling differences between the 2 factors
- Customers in the orange region have low credit cards and low credit limit (mostly being outside of the blue region)
- These customers may have limited banking engagement and/or fewer financial resources
- These customers may also have banking relationships elsewhere
- Marketing to these customers may be more allusive, since it may entail a long-term endeavor with few "quick wins", thus more of an incidental rather than intentional nature
- Customers in the blue region have high credit limit
- Due to the overlap of orange within this blue region, there is ambiguity as to whether they have many or few credit cards (graph is not to scale)
- Ambiguity from the overlapping 2 regions will therefore need further bivariate analysis
- AUC Comparison, normalized for absolute scale, along with KMeans Customer-Level Analysis and Segment-Specific Insights can provide a more methodical approach for predictive analysis and marketing with intentionality to these customers than the customers in the orange region
# Further Bivariate Analysis: AUC Comparison
from sklearn.preprocessing import StandardScaler
from scipy.stats import gaussian_kde
from scipy.integrate import quad
'''
import numpy as np
from scipy.stats import gaussian_kde
from scipy.integrate import quad
import matplotlib.pyplot as plt
'''
# Step 1: Standardize both variables
scaler = StandardScaler()
df2[['Standardized_Credit_Limit', 'Standardized_Credit_Cards']] = scaler.fit_transform(
df2[['Avg_Credit_Limit', 'Total_Credit_Cards']]
)
# Step 2: Define KDEs for standardized data
kde_credit_limit = gaussian_kde(df2['Standardized_Credit_Limit'])
kde_credit_cards = gaussian_kde(df2['Standardized_Credit_Cards'])
# Common X range
x_min = min(df2[['Standardized_Credit_Limit', 'Standardized_Credit_Cards']].min())
x_max = max(df2[['Standardized_Credit_Limit', 'Standardized_Credit_Cards']].max())
x_range = np.linspace(x_min, x_max, 1000)
y_credit_limit = kde_credit_limit(x_range)
y_credit_cards = kde_credit_cards(x_range)
# Step 3: Calculate Overlap
def overlap_area(x):
return min(kde_credit_limit(x), kde_credit_cards(x))
overlap_auc, _ = quad(overlap_area, x_min, x_max)
# Total AUCs
total_auc_credit_limit = quad(lambda x: kde_credit_limit(x), x_min, x_max)[0]
total_auc_credit_cards = quad(lambda x: kde_credit_cards(x), x_min, x_max)[0]
# Normalize overlap
overlap_ratio_credit_limit = overlap_auc / total_auc_credit_limit
overlap_ratio_credit_cards = overlap_auc / total_auc_credit_cards
# Visualization
plt.figure(figsize=(10, 6))
plt.plot(x_range, y_credit_limit, label='Standardized Avg_Credit_Limit (KDE)', color='blue')
plt.plot(x_range, y_credit_cards, label='Standardized Total_Credit_Cards (KDE)', color='orange')
plt.fill_between(
x_range,
np.minimum(y_credit_limit, y_credit_cards),
color='purple',
alpha=0.5,
label='Overlap Region'
)
plt.title('Overlapping AUC Between Standardized Avg_Credit_Limit and Total_Credit_Cards')
plt.xlabel('Standardized Value Range')
plt.ylabel('Density')
plt.legend()
plt.grid()
plt.show()
# Print Results
print(f"Overlap AUC: {overlap_auc:.4f}")
print(f"Total AUC (Standardized Avg_Credit_Limit): {total_auc_credit_limit:.4f}")
print(f"Total AUC (Standardized Total_Credit_Cards): {total_auc_credit_cards:.4f}")
print(f"Overlap as % of Standardized Avg_Credit_Limit AUC: {overlap_ratio_credit_limit:.2%}")
print(f"Overlap as % of Standardized Total_Credit_Cards AUC: {overlap_ratio_credit_cards:.2%}")
Overlap AUC: 0.6792 Total AUC (Standardized Avg_Credit_Limit): 0.9977 Total AUC (Standardized Total_Credit_Cards): 0.9509 Overlap as % of Standardized Avg_Credit_Limit AUC: 68.08% Overlap as % of Standardized Total_Credit_Cards AUC: 71.43%
Observations¶
The earlier sliver of overlap makes up for 68% of the customer dataset. The towering blue and orange regions combined makes up for 32%. Therefore this high absolute overlap area suggests a significant proportion of the distributions align, and that the ranges of credit limits and credit card counts are shared for many customers. The total AUC scores above confirms the Kernel Density Estimations are well-defined and appropriately scaled.
Of the purple overlapping region,
- 68.08% pertains to Avg_Credit_Limit
- 71.43% pertains to Total_Credit_Cards These percentages suggest a meaningful overlap between the two variables.
The remaining ~30% non-overlapping areas may represent unique customer groups that can be further reviewed.
# Credit Profile/Upselling: EDA
from sklearn.cluster import KMeans
import warnings
# Suppress warnings
warnings.filterwarnings("ignore", category=FutureWarning) # Suppress FutureWarnings
warnings.filterwarnings("ignore", category=UserWarning) # Suppress UserWarnings
# Extract normalized data
normalized_data = df2[['Standardized_Credit_Limit', 'Standardized_Credit_Cards']]
# Apply KMeans Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
df2['Cluster'] = kmeans.fit_predict(normalized_data)
# Visualize Clusters
plt.figure(figsize=(10, 6))
plt.scatter(
df2['Standardized_Credit_Limit'],
df2['Standardized_Credit_Cards'],
c=df2['Cluster'],
cmap='viridis',
alpha=0.6
)
plt.title('Customer Segments Based on Credit Limit and Credit Cards')
plt.xlabel('Standardized Credit Limit')
plt.ylabel('Standardized Total Credit Cards')
plt.colorbar(label='Cluster')
plt.grid()
plt.show()
EDA_Observations_on_Credit_Profile¶
The clusters from this preliminary Credit Profile EDA plot can also be helpful to reveal distinct customer segments based on standardized credit limit and credit card counts. Along with the previous EDA on CustomerID, Credit Profile are non-volitional factors that indirectly drive costs, revenue and/or service dissatisfaction. Engaging with these factors are more incidental (i.e. being ready for when the prospect/customer is willing and able to buy). Interpretations will follow after further PCA and Ensemble Clustering analyses.
DecisionPoint_Upsell¶
- While the bank is looking to upsell to its existing customers [2], the dataset provides a very limited view on how to directly contribute to any upselling efforts. A potential proxy for upselling can be Total_Credit_Cards if we assume that holding more credit cards will correlate with higher customer value (i.e. higher revenue, loyalty, or engagement)
- However, even with using Total_Credit_Cards as a proxy, too many cards can lead to diminished returns (i.e. credit risk for customers, high level of service for the bank, etc.)
- Given the bank's focus on credit cards however, "Upselling" will pertain to credit cards and loan products as data inference can be drawn from the relationships between Avg_Credit_Limit and Total_Credit_Cards
PCA and Ensemble Clustering¶
Banking_Interaction_(bank_visits,_online_visits,_and_calls_made)¶
- Banking Interaction does not represent a goal in this study as there are many ways banking interactions can be a cost to the business, while at the same time, presenting revenue opportunities (thus involving confounding factors)
- These variables represent volitional factors that influence costs, revenue, and/or service dissatisfaction
- The limited dataset presents Total_Credit_Cards and Credit_Limit as proxies for upselling [3]
- Banking Interaction, therefore, is meaningful as it pertains to upselling opportunities, specifically for Credit Card and Loan Products
PCA and Clustering Model Analysis to evaluate:¶
UpsellingOpportunities¶
Upselling: PCA¶
# Preprocess data, use PCA
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
features = ['Total_Credit_Cards', 'Total_visits_bank', 'Total_visits_online', 'Total_calls_made', 'Avg_Credit_Limit']
scaler = StandardScaler()
normalized_data = scaler.fit_transform(df2[features])
pca = PCA()
pca_data = pca.fit_transform(normalized_data)
# Plot explained variance ratio
import matplotlib.pyplot as plt
plt.plot(range(1, len(features) + 1), pca.explained_variance_ratio_.cumsum(), marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.show()
# Review Principal Components: Access PCA loadings
loadings = pd.DataFrame(
pca.components_,
columns=features, # Original feature names
index=[f'PC{i+1}' for i in range(len(features))] # Label components
)
print(loadings)
Total_Credit_Cards Total_visits_bank Total_visits_online \
PC1 0.597679 0.280492 0.111783
PC2 0.030171 -0.586587 0.665161
PC3 0.284983 0.613522 0.304948
PC4 0.741352 -0.445278 -0.318388
PC5 -0.105122 -0.050586 -0.592200
Total_calls_made Avg_Credit_Limit
PC1 -0.559129 0.488859
PC2 0.223527 0.403240
PC3 0.670351 -0.003461
PC4 0.235605 -0.308617
PC5 0.364047 0.709337
# Visualize Principal Components impact
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
sns.heatmap(loadings, annot=True, cmap='coolwarm')
plt.title('Upselling PCA Loadings Heatmap')
plt.xticks(fontsize=8, rotation = 45)
plt.show()
Finding¶
Although the PCA cumulative explained variance plot suggests 2 components as an inflection point, the heat map of loadings reveals that dropping to 2 components would result in the loss of important feature contributions, particularly from PCs 3, 4, and 5, which capture nuanced and actionable patterns in the data.
Decision Point¶
Based on the loadings and their contributions across all 5 principal components (PCs), including all 5 components appears to be meaningful, especially for capturing nuanced behaviors and contrasts in the data. This approach will capture detailed behavioral patterns (e.g., identifying low-credit, digitally active customers in PC5, etc.), and/or diversity within the customer base. With the limited dataset, the inclusion of all 5 components will not overly complicate subsequent clustering analysis.
Upselling: Clustering Analysis¶
# KMeans / GMM / KMedoids
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from pyclustering.cluster.kmedoids import kmedoids
from pyclustering.utils.metric import distance_metric, type_metric
import matplotlib.pyplot as plt
import warnings
import pandas as pd
# Suppress warnings
warnings.filterwarnings("ignore", category=FutureWarning) # Suppress FutureWarnings
warnings.filterwarnings("ignore", category=UserWarning) # Suppress UserWarnings
# ===== Step 1: Determine Optimal Number of Clusters (Elbow Plot) =====
# Calculate WCSS for different numbers of clusters
wcss = []
for k in range(1, 10):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(pca_data)
wcss.append(kmeans.inertia_)
# Plot Elbow Curve
plt.figure(figsize=(5, 3))
plt.plot(range(1, 10), wcss, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.title('Elbow Method for Optimal Clusters')
plt.show()
# Optimal number of clusters (can be adjusted based on the Elbow Plot)
optimal_k = 3
# ===== Step 2: Add PCA Features to DataFrame =====
# Define feature column names
features = [f"PC{i+1}" for i in range(pca_data.shape[1])] # Assuming PCA was used
# Create a DataFrame from PCA data if it isn't already part of df2
for i, feature in enumerate(features):
df2[feature] = pca_data[:, i]
# ===== Step 3: Apply Clustering Methods =====
# ---- KMeans Clustering ----
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
kmeans_clusters = kmeans.fit_predict(pca_data)
df2['KMeans_Cluster'] = kmeans_clusters
# ---- Gaussian Mixture Model (GMM) ----
gmm = GaussianMixture(n_components=optimal_k, random_state=42)
gmm_clusters = gmm.fit_predict(pca_data)
df2['GMM_Cluster'] = gmm_clusters
# ---- K-Medoids Clustering ----
# Define initial medoids (indices based on domain knowledge or random)
initial_medoids = [0, 50, 100] # Example indices for 3 clusters
kmedoids_instance = kmedoids(
pca_data, initial_medoids, metric=distance_metric(type_metric.EUCLIDEAN)
)
kmedoids_instance.process()
# Extract K-Medoids cluster assignments
kmedoids_clusters = kmedoids_instance.get_clusters()
df2['KMedoids_Cluster'] = -1
for cluster_id, indices in enumerate(kmedoids_clusters):
df2.loc[indices, 'KMedoids_Cluster'] = cluster_id
# ===== Step 4: Analyze Clusters =====
# Function to analyze cluster profiles
def analyze_clusters(df, cluster_column, feature_columns):
cluster_profiles = df.groupby(cluster_column)[feature_columns].mean()
cluster_profiles.index = cluster_profiles.index + 1 # Make clusters 1-based index
return cluster_profiles
# Generate cluster profiles for each method
print("KMeans Cluster Profiles (averages):")
print(analyze_clusters(df2, 'KMeans_Cluster', features))
print("\nGMM Cluster Profiles (averages):")
print(analyze_clusters(df2, 'GMM_Cluster', features))
print("\nKMedoids Cluster Profiles (averages):")
print(analyze_clusters(df2, 'KMedoids_Cluster', features))
KMeans Cluster Profiles (averages):
PC1 PC2 PC3 PC4 PC5
KMeans_Cluster
1 0.647279 -0.880009 -0.024133 0.032796 0.038631
2 2.992098 3.531877 0.118569 -0.107173 -0.123788
3 -1.783279 0.728079 0.015120 -0.032593 -0.038939
GMM Cluster Profiles (averages):
PC1 PC2 PC3 PC4 PC5
GMM_Cluster
1 0.647279 -0.880009 -0.024133 0.032796 0.038631
2 2.992098 3.531877 0.118569 -0.107173 -0.123788
3 -1.783279 0.728079 0.015120 -0.032593 -0.038939
KMedoids Cluster Profiles (averages):
PC1 PC2 PC3 PC4 PC5
KMedoids_Cluster
1 0.640276 -0.875990 -0.029436 0.033631 0.037106
2 2.992098 3.531877 0.118569 -0.107173 -0.123788
3 -1.792937 0.735541 0.024742 -0.034641 -0.036973
Observations¶
- The Elbow Plot identifies use of 3 clusters is optimal
- KMeans and KMedoids results are very consistent, both being centroid-based, particularly for Clusters 1 and 3
- GMM, being more sensitive to data distribution, captures different clustering behaviors
Upselling: Ensemble Analysis¶
# 3-Model Ensemble
from sklearn.metrics import silhouette_score
import numpy as np
from scipy.stats import mode
# Assume the previous models have populated the following columns in df2:
# 'KMeans_Cluster', 'GMM_Cluster', 'KMedoids_Cluster'
# Extract cluster labels from the three models
kmeans_labels = df2['KMeans_Cluster'].to_numpy()
gmm_labels = df2['GMM_Cluster'].to_numpy()
kmedoids_labels = df2['KMedoids_Cluster'].to_numpy()
# Combine the cluster labels into a single array
all_labels = np.array([kmeans_labels, gmm_labels, kmedoids_labels])
# Perform majority voting to generate ensemble cluster assignments
ensemble_labels = mode(all_labels, axis=0)[0].flatten()
df2['Ensemble_Cluster'] = ensemble_labels
# Calculate silhouette scores for all models, including the ensemble
kmeans_silhouette = silhouette_score(pca_data, kmeans_labels)
gmm_silhouette = silhouette_score(pca_data, gmm_labels)
kmedoids_silhouette = silhouette_score(pca_data, kmedoids_labels)
ensemble_silhouette = silhouette_score(pca_data, ensemble_labels)
# Print silhouette scores for comparison
print("Silhouette Scores for Clustering Models:")
print(f"KMeans Silhouette Score: {kmeans_silhouette:.4f}")
print(f"GMM Silhouette Score: {gmm_silhouette:.4f}")
print(f"KMedoids Silhouette Score: {kmedoids_silhouette:.4f}")
print(f"3-Model Ensemble Silhouette Score: {ensemble_silhouette:.4f}")
Silhouette Scores for Clustering Models: KMeans Silhouette Score: 0.5157 GMM Silhouette Score: 0.5157 KMedoids Silhouette Score: 0.5158 3-Model Ensemble Silhouette Score: 0.5157
Findings¶
Based on the above silhouette scores, GMM's distinctive clustering is not the best fit here. KMeans and KMedoids are the best performing models. An ensemble combining these 2 models will be evaluated next.
# KMeans + KMedoids 2-model Ensemble
# Combine KMeans and KMedoids labels
refined_labels = np.array([kmeans_labels, kmedoids_labels])
ensemble_labels = mode(refined_labels, axis=0)[0].flatten()
df2['Ensemble_Cluster'] = ensemble_labels
# Calculate silhouette score for the refined ensemble
ensemble_silhouette = silhouette_score(pca_data, ensemble_labels)
# Print silhouette scores for comparison
print("Silhouette Scores for Revised Ensemble Clustering Models:")
print(f"KMeans Silhouette Score: {kmeans_silhouette:.4f}")
print(f"KMedoids Silhouette Score: {kmedoids_silhouette:.4f}")
print(f"Refined (2-Model) Ensemble Silhouette Score: {ensemble_silhouette:.4f}")
Silhouette Scores for Revised Ensemble Clustering Models: KMeans Silhouette Score: 0.5157 KMedoids Silhouette Score: 0.5158 Refined (2-Model) Ensemble Silhouette Score: 0.5158
Findings¶
The revised ensemble with only KMeans and KMedoids is still not as robust as the individual models. Thus, KMeans and KMedoids will be used, with their complementary strengths.
# Visualize KMeans and KMedoids
import plotly.express as px
# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42) # Adjust perplexity as needed
tsne_results = tsne.fit_transform(pca_data) # Use PCA-reduced data or normalized original data
# Ensure original index is retained
tsne_df2 = pd.DataFrame(tsne_results, columns=['TSNE1', 'TSNE2'], index=df2.index)
# Add cluster labels and original fields
tsne_df2['KMeans_Cluster'] = df2['KMeans_Cluster']
#tsne_df2['GMM_Cluster'] = df2['GMM_Cluster']
tsne_df2['KMedoids_Cluster'] = df2['KMedoids_Cluster']
# Specific fields from df2 for hover information
fields_to_include = ['Avg_Credit_Limit', 'Total_Credit_Cards', 'Total_visits_bank', 'Total_visits_online']
tsne_df2 = tsne_df2.join(df2[fields_to_include])
# Visualize with Plotly for K-Means
fig_kmeans = px.scatter(
tsne_df2, x='TSNE1', y='TSNE2', color='KMeans_Cluster',
hover_data=fields_to_include, # Add fields for hover information
title='t-SNE Visualization with K-Means Clusters',
color_continuous_scale='Viridis',
)
fig_kmeans.update_xaxes(showticklabels=False) # Hide x-axis tick labels
fig_kmeans.update_yaxes(showticklabels=False) # Hide y-axis tick labels
fig_kmeans.update_layout(autosize=False, width=800, height=350, coloraxis_showscale=False)
fig_kmeans.show()
# Visualize with Plotly for GMM
#fig_gmm = px.scatter(
#tsne_df2, x='TSNE1', y='TSNE2', color='GMM_Cluster',
#hover_data=fields_to_include,
#title='t-SNE Visualization with GMM Clusters',
#color_continuous_scale='Viridis'
#)
#fig_gmm.update_xaxes(showticklabels=False) # Hide x-axis tick labels
#fig_gmm.update_yaxes(showticklabels=False) # Hide y-axis tick labels
#fig_gmm.show()
# Visualize with Plotly for K-Medoids
fig_kmedoids = px.scatter(
tsne_df2, x='TSNE1', y='TSNE2', color='KMedoids_Cluster',
hover_data=fields_to_include,
title='t-SNE Visualization with K-Medoids Clusters',
color_continuous_scale='Viridis'
)
fig_kmedoids.update_xaxes(showticklabels=False) # Hide x-axis tick labels
fig_kmedoids.update_yaxes(showticklabels=False) # Hide y-axis tick labels
fig_kmedoids.update_layout(autosize=False, width=800, height=350, coloraxis_showscale=False)
fig_kmedoids.show()
Observation¶
Both KMeans and KMedoids provided well-separated clustering, as shown in the t-SNE visualizations, indicating that they successfully identified distinct groups in the data. The nuanced differences in cluster boundaries and shapes between the two models may reflect areas of ambiguity in the data. Leveraging both models offers a complementary perspective, capturing distinct aspects of the clustering structure and providing a more holistic view of the data.
Upselling Opportunities: Interpretations¶
K-Means Clusters have some overlapping but are mostly separated, indicating mostly clear segmentation of customer groups. This shows segmented customers into distinct groups based on their behavior (e.g., Total_visits_online, Total_Credit_Cards, etc.). This segmentation is suitable for operational simplicity when clear and distinct groups are needed for actionable insights.
K-Medoids Clusters have less sensitivity to outliers. This is evident in the clean delineation between data points in the t-SNE above. This approach balances the robustness of GMM with the clarity of K-Means, especially in handling outliers. Of the three models, this seems preferred, minimizing noise or extreme values.
# Credit Profile EDA interpretations
from sklearn.cluster import KMeans
import warnings
# Suppress warnings
warnings.filterwarnings("ignore", category=FutureWarning) # Suppress FutureWarnings
warnings.filterwarnings("ignore", category=UserWarning) # Suppress UserWarnings
print("A Revisit to Credit Profile KMeans Clustering")
print("\n")
# Extract normalized data
normalized_data = df2[['Standardized_Credit_Limit', 'Standardized_Credit_Cards']]
# Apply KMeans Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
df2['Cluster'] = kmeans.fit_predict(normalized_data)
# Visualize Clusters
plt.figure(figsize=(7, 3))
plt.scatter(
df2['Standardized_Credit_Limit'],
df2['Standardized_Credit_Cards'],
c=df2['Cluster'],
cmap='viridis',
alpha=0.6
)
plt.title('Customer Segments Based on Credit Profile', fontsize=10)
plt.xlabel('Standardized Credit Limit', fontsize=7)
plt.ylabel('Standardized Total Credit Cards', fontsize=7)
plt.tick_params(axis='x', which='both', bottom=False, labelbottom=False)
plt.tick_params(axis='y', which='both', left=False, labelleft=False)
plt.grid()
plt.show()
A Revisit to Credit Profile KMeans Clustering
Upsell_Opportunities:_Credit_Profile_implications¶
Identifying "Upselling Opportunities" from within this limited dataset can be most evident by considering credit cards and loan products. Insights can be derived from the previous plot shown above:
- Teal Cluster at the Top Right (High Credit Limit & High Credit Cards)
- Customers in this cluster have both high standardized credit limits and high standardized credit card counts. This group likely represents premium customers with significant purchasing power and financial engagement. This is a very promising segment. These customers may be ideal candidates for premium products such as high-reward credit cards, investment services, and other exclusive benefits (e.g., concierge services, travel perks). They are likely already engaged with multiple financial products, so marketing should focus on cross-selling or retention strategies.
- Purple Cluster at the Bottom Left (Low Credit Limit & Low Credit Cards)
- Customers in this group have both low credit limits and low card counts. They are likely low-value customers with limited financial engagement. This is probably the least promising segment. These customers might not have the capacity to adopt additional financial products. Marketing efforts could focus on financial education or low-risk credit-building products. Success in capturing this segment seem to be lend more to an incidental approach rather than an intentional one, like for the top right quadrant. Their potential for significant growth is limited, so they may not be worth heavy marketing investment.
- Yellow Cluster at the Middle Left (Moderate Credit Cards & Low to Moderate Credit Limit)
- These customers have moderate card counts but relatively low credit limits. They may already be utilizing their credit limits heavily (possibly maxed out). This is a moderately promising segment. These customers might be good candidates for Credit limit increases (if creditworthiness supports it), and Budgeting tools or financial management products. However, they may represent credit risk if their current limits are already over-utilized.
IdealCustomerProfile_(ICP)¶
ICP: PCA¶
# Preprocess and PCA
# Standardize features
from sklearn.preprocessing import StandardScaler
features = ['Avg_Credit_Limit', 'Total_Credit_Cards', 'Total_visits_bank', 'Total_visits_online', 'Total_calls_made']
scaler = StandardScaler()
df2_icp_scaled = scaler.fit_transform(df2[features])
# Apply PCA
from sklearn.decomposition import PCA
import pandas as pd
pca_ICP = PCA()
pca_transformed_ICP = pca_ICP.fit_transform(df2_icp_scaled)
# Add PCA components back to a new DataFrame
df2_icp_pca = pd.DataFrame(
pca_transformed_ICP,
columns=[f'PCA_ICP_{i+1}' for i in range(pca_ICP.n_components_)]
)
# Clustering variables
from sklearn.cluster import KMeans
import warnings
# Suppress warnings
warnings.filterwarnings("ignore", category=FutureWarning) # Suppress FutureWarnings
warnings.filterwarnings("ignore", category=UserWarning) # Suppress UserWarnings
# Clustering on the first few PCA components
kmeans_ICP = KMeans(n_clusters=3, random_state=42)
df2['ICP_Cluster'] = kmeans_ICP.fit_predict(df2_icp_pca.iloc[:, :3])
# PCA Loadings Table and Heatmap
# Determine relationships between original features and PCA components
loadings_ICP = pd.DataFrame(
pca_ICP.components_,
columns=['Avg_Credit_Limit', 'Total_Credit_Cards', 'Total_visits_bank', 'Total_visits_online', 'Total_calls_made'],
index=[f'PCA_{i+1}' for i in range(pca_ICP.n_components_)]
)
print("\nContribution Scores from Principal Components \n")
print(loadings_ICP)
# Visualize relationship
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
sns.heatmap(loadings_ICP, annot=True, cmap='coolwarm')
plt.title('PCA Loadings Heatmap (ICP Study)')
plt.xticks(fontsize=8, rotation = 45)
plt.yticks(fontsize=8)
plt.show()
Contribution Scores from Principal Components
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank \
PCA_1 0.488859 0.597679 0.280492
PCA_2 0.403240 0.030171 -0.586587
PCA_3 -0.003461 0.284983 0.613522
PCA_4 -0.308617 0.741352 -0.445278
PCA_5 0.709337 -0.105122 -0.050586
Total_visits_online Total_calls_made
PCA_1 0.111783 -0.559129
PCA_2 0.665161 0.223527
PCA_3 0.304948 0.670351
PCA_4 -0.318388 0.235605
PCA_5 -0.592200 0.364047
Baseline¶
The inclusion of all principal components will be used as a baseline by which reduction of components can be compared.
# Plot cumulative explained variance ratio
# Use the number of features for x-axis limit and cumulative sum of explained variance ratio
plt.figure(figsize=(8, 5))
plt.plot(
range(1, len(pca_ICP.explained_variance_ratio_) + 1),
pca_ICP.explained_variance_ratio_.cumsum(),
marker='o', #linestyle='--'
)
plt.xlabel('Number of Principal Components', fontsize=12)
plt.ylabel('Cumulative Explained Variance', fontsize=12)
plt.title('Explained Variance Ratio by Principal Components (ICP)', fontsize=14)
plt.grid(True)
plt.show()
Observation¶
A clear inflection point is marked at 2 principal components. Further analysis will be reviewed to compare this reduction to the baseline.
# Evaluate ICP context for components, all vs reduced
# Filter numeric columns (excluding non-numeric columns like 'Customer_ID')
ICP_numeric_cols = [
'Avg_Credit_Limit',
'Total_Credit_Cards',
'Total_visits_bank',
'Total_visits_online',
'Total_calls_made'
]
# Ensure 'ICP_Cluster_All' is included in numeric columns for grouping
if 'ICP_Cluster_All' not in ICP_numeric_cols:
ICP_numeric_cols.append('ICP_Cluster_All')
# Perform clustering with all 5 components
from sklearn.cluster import KMeans
kmeans_all = KMeans(n_clusters=3, random_state=42)
df2['ICP_Cluster_All'] = kmeans_all.fit_predict(pca_transformed_ICP[:, :5])
# Cluster summary for all components
summary_all = df2[ICP_numeric_cols].groupby('ICP_Cluster_All').mean()
# Make the cluster labels 1-indexed
summary_all.index = summary_all.index + 1
summary_all.index.name = 'Cluster (Averages)'
print("Cluster Summary (All Components):")
print(summary_all)
# Summary of Reduced Components
from sklearn.cluster import KMeans
import pandas as pd
# Perform clustering with the first 2 components
kmeans_reduced = KMeans(n_clusters=3, random_state=42)
df2['ICP_Cluster_Reduced'] = kmeans_reduced.fit_predict(pca_transformed_ICP[:, :2])
# Cluster summary based on reduced components
# Use PCA-transformed data ONLY for clustering and grouping
summary_reduced = (
df2[['ICP_Cluster_Reduced']]
.join(df2[[
'Avg_Credit_Limit', 'Total_Credit_Cards',
'Total_visits_bank', 'Total_visits_online', 'Total_calls_made'
]])
.groupby('ICP_Cluster_Reduced')
.mean()
)
# Make cluster labels 1-indexed for readability
summary_reduced.index = summary_reduced.index + 1
summary_reduced.index.name = 'Cluster (Averages)'
print("\nICP Cluster Summary (Reduced Components):")
print(summary_reduced)
Cluster Summary (All Components):
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank \
Cluster (Averages)
1 33782.383420 5.515544 3.489637
2 141040.000000 8.740000 0.600000
3 12174.107143 2.410714 0.933036
Total_visits_online Total_calls_made
Cluster (Averages)
1 0.981865 2.000000
2 10.900000 1.080000
3 3.553571 6.870536
ICP Cluster Summary (Reduced Components):
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank \
Cluster (Averages)
1 33782.383420 5.515544 3.489637
2 12174.107143 2.410714 0.933036
3 141040.000000 8.740000 0.600000
Total_visits_online Total_calls_made
Cluster (Averages)
1 0.981865 2.000000
2 3.553571 6.870536
3 10.900000 1.080000
Observation¶
Results continue to point to a representative reduction from the full use of all components.
# Check how many customers are assigned to the same cluster across methods
comparison = pd.crosstab(df2['ICP_Cluster_All'], df2['ICP_Cluster_Reduced'])
print("\nCluster Assignment Comparison:")
comparison.columns = comparison.columns + 1
comparison.index = comparison.index + 1
print(comparison)
Cluster Assignment Comparison: ICP_Cluster_Reduced 1 2 3 ICP_Cluster_All 1 386 0 0 2 0 0 50 3 0 224 0
Observation¶
The comparison of cluster assignments between the All Components and Reduced Components strongly supports that the reduced components are adequately representative, with no mixing or overlap. This shows reduced components capture the same patterns as all components with consistency, representativeness and efficiency.
import matplotlib.pyplot as plt
# Create a DataFrame for PCA-transformed data
pca_df = pd.DataFrame(
pca_transformed_ICP[:, :2], # Use the first two components
columns=['PCA_1', 'PCA_2']
)
pca_df['Cluster'] = df2['ICP_Cluster_Reduced'] # Add cluster labels
# Visualize clusters
plt.figure(figsize=(8, 6))
plt.scatter(
pca_df['PCA_1'],
pca_df['PCA_2'],
c=pca_df['Cluster'],
cmap='viridis',
s=50,
alpha=0.7
)
plt.title('ICP Customer Clusters (PCA Reduced)', fontsize=14)
plt.xlabel('PCA_1', fontsize=12)
plt.ylabel('PCA_2', fontsize=12)
#plt.colorbar(label='Cluster')
plt.grid(True)
plt.show()
Observations_PCA_reduced¶
Whereas in the previous Upselling study, 5 components were necessary to uncover upselling opportunities—exhaustively reviewing both what to avoid (to prevent driving customers away) and what to pursue (to drive revenue)-the focus in the ICP study is on narrowly defining the ideal customer profile (ICP), allowing for a more targeted approach. Whereas the upselling study was divergent in nature, exploring a wide range of possibilities, the ICP study is convergent, honing in on specific traits that define the ideal customer.
Findings¶
Two principal components are sufficient to create the Ideal Customer Profile. The use of 3 clusters can be clearly confirmed with the PCA plot.
ICP: Ensemble Clustering Analysis¶
# All components, with Ensemble
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from pyclustering.cluster.kmedoids import kmedoids
from pyclustering.utils.metric import distance_metric, type_metric
from sklearn.metrics import silhouette_score
import numpy as np
from scipy.stats import mode
# Apply KMeans Clustering for ICP
kmeans_ICP = KMeans(n_clusters=3, random_state=42)
df2['KMeans_Cluster_ICP'] = kmeans_ICP.fit_predict(pca_transformed_ICP)
# Calculate silhouette score for KMeans
silhouette_kmeans_icp_all = silhouette_score(pca_transformed_ICP, df2['KMeans_Cluster_ICP'])
# GMM Clustering
gmm_ICP = GaussianMixture(n_components=3, random_state=42)
df2['GMM_Cluster_ICP'] = gmm_ICP.fit_predict(pca_transformed_ICP)
# Calculate silhouette score for GMM
silhouette_gmm_icp_all = silhouette_score(pca_transformed_ICP, df2['GMM_Cluster_ICP'])
# K-Medoids Clustering
initial_medoids = [0, 50, 100] # Example indices; modify based on your data
kmedoids_instance_ICP = kmedoids(
pca_transformed_ICP, initial_medoids, metric=distance_metric(type_metric.EUCLIDEAN)
)
kmedoids_instance_ICP.process()
kmedoids_clusters_ICP = kmedoids_instance_ICP.get_clusters()
# Assign K-Medoids clusters to df2
df2['KMedoids_Cluster_ICP'] = -1
for cluster_id, indices in enumerate(kmedoids_clusters_ICP):
df2.loc[indices, 'KMedoids_Cluster_ICP'] = cluster_id
# Calculate silhouette score for K-Medoids
silhouette_kmedoids_icp_all = silhouette_score(pca_transformed_ICP, df2['KMedoids_Cluster_ICP'])
# Ensemble Clustering (Majority Voting)
kmeans_labels = df2['KMeans_Cluster_ICP'].to_numpy()
gmm_labels = df2['GMM_Cluster_ICP'].to_numpy()
kmedoids_labels = df2['KMedoids_Cluster_ICP'].to_numpy()
# Combine labels from all models
all_labels = np.array([kmeans_labels, gmm_labels, kmedoids_labels])
ensemble_labels = mode(all_labels, axis=0)[0].flatten()
df2['Ensemble_Cluster_ICP'] = ensemble_labels
# Calculate silhouette score for Ensemble
silhouette_ensemble_icp_all = silhouette_score(pca_transformed_ICP, ensemble_labels)
# Print silhouette scores
print(f"KMeans Silhouette Score (All Components): {silhouette_kmeans_icp_all:.4f}")
print(f"GMM Silhouette Score (All Components): {silhouette_gmm_icp_all:.4f}")
print(f"K-Medoids Silhouette Score (All Components): {silhouette_kmedoids_icp_all:.4f}")
print(f"Ensemble Silhouette Score (All Components): {silhouette_ensemble_icp_all:.4f}")
KMeans Silhouette Score (All Components): 0.5157 GMM Silhouette Score (All Components): 0.5157 K-Medoids Silhouette Score (All Components): 0.5158 Ensemble Silhouette Score (All Components): 0.5157
# KMeans / GMM / KMedoids with first 2 components
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn_extra.cluster import KMedoids
from sklearn.metrics import silhouette_score
import numpy as np
from scipy.stats import mode
# KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
df2['KMeans_Cluster_ICP'] = kmeans.fit_predict(pca_transformed_ICP[:, :2])
# Silhouette score for KMeans
silhouette_kmeans_icp = silhouette_score(pca_transformed_ICP[:, :2], df2['KMeans_Cluster_ICP'])
# GMM clustering
gmm = GaussianMixture(n_components=3, random_state=42)
df2['GMM_Cluster_ICP'] = gmm.fit_predict(pca_transformed_ICP[:, :2])
# Silhouette score for GMM
silhouette_gmm_icp = silhouette_score(pca_transformed_ICP[:, :2], df2['GMM_Cluster_ICP'])
# K-Medoids clustering
kmedoids = KMedoids(n_clusters=3, random_state=42, metric='euclidean')
df2['KMedoids_Cluster_ICP'] = kmedoids.fit_predict(pca_transformed_ICP[:, :2])
# Silhouette score for K-Medoids
silhouette_kmedoids_icp = silhouette_score(pca_transformed_ICP[:, :2], df2['KMedoids_Cluster_ICP'])
# Ensemble Clustering (Majority Voting)
kmeans_labels = df2['KMeans_Cluster_ICP'].to_numpy()
gmm_labels = df2['GMM_Cluster_ICP'].to_numpy()
kmedoids_labels = df2['KMedoids_Cluster_ICP'].to_numpy()
# Combine labels from all models
all_labels = np.array([kmeans_labels, gmm_labels, kmedoids_labels])
ensemble_labels = mode(all_labels, axis=0)[0].flatten()
df2['Ensemble_Cluster_ICP'] = ensemble_labels
# Silhouette score for Ensemble
silhouette_ensemble_icp = silhouette_score(pca_transformed_ICP[:, :2], ensemble_labels)
# Print silhouette scores
print(f"KMeans Silhouette Score (ICP): {silhouette_kmeans_icp:.4f}")
print(f"GMM Silhouette Score (ICP): {silhouette_gmm_icp:.4f}")
print(f"K-Medoids Silhouette Score (ICP): {silhouette_kmedoids_icp:.4f}")
print(f"Ensemble Silhouette Score (ICP): {silhouette_ensemble_icp:.4f}")
KMeans Silhouette Score (ICP): 0.6829 GMM Silhouette Score (ICP): 0.6829 K-Medoids Silhouette Score (ICP): 0.5125 Ensemble Silhouette Score (ICP): 0.6829
# First 2 components with KMeans / GMM 2-Model Ensemble
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import numpy as np
from scipy.stats import mode
# KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
df2['KMeans_Cluster_ICP'] = kmeans.fit_predict(pca_transformed_ICP[:, :2])
# Silhouette score for KMeans
silhouette_kmeans_icp = silhouette_score(pca_transformed_ICP[:, :2], df2['KMeans_Cluster_ICP'])
# GMM clustering
gmm = GaussianMixture(n_components=3, random_state=42)
df2['GMM_Cluster_ICP'] = gmm.fit_predict(pca_transformed_ICP[:, :2])
# Silhouette score for GMM
silhouette_gmm_icp = silhouette_score(pca_transformed_ICP[:, :2], df2['GMM_Cluster_ICP'])
# Ensemble Clustering (Majority Voting with KMeans and GMM)
kmeans_labels = df2['KMeans_Cluster_ICP'].to_numpy()
gmm_labels = df2['GMM_Cluster_ICP'].to_numpy()
# Combine labels from KMeans and GMM
refined_labels = np.array([kmeans_labels, gmm_labels])
ensemble_labels = mode(refined_labels, axis=0)[0].flatten()
df2['Ensemble_Cluster_ICP'] = ensemble_labels
# Silhouette score for Ensemble
silhouette_ensemble_icp = silhouette_score(pca_transformed_ICP[:, :2], ensemble_labels)
# Print silhouette scores
print(f"KMeans Silhouette Score (ICP): {silhouette_kmeans_icp:.4f}")
print(f"GMM Silhouette Score (ICP): {silhouette_gmm_icp:.4f}")
print(f"Ensemble Silhouette Score (ICP, KMeans + GMM): {silhouette_ensemble_icp:.4f}")
KMeans Silhouette Score (ICP): 0.6829 GMM Silhouette Score (ICP): 0.6829 Ensemble Silhouette Score (ICP, KMeans + GMM): 0.6829
Findings¶
The silhouette scores show that the 3-Model and the 2-Model Ensembles did not perform any better than the individual models. Likewise, K-Medoids is not suited for this study. So only KMeans and GMM will be used for creating the Ideal Customer profile. Next, the full component should be compared against the first 2 components.
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import numpy as np
from scipy.stats import mode
# ===== Full Components Clustering =====
# Apply KMeans Clustering for ICP
kmeans_ICP = KMeans(n_clusters=3, random_state=42)
df2['KMeans_Cluster_ICP_All'] = kmeans_ICP.fit_predict(pca_transformed_ICP)
# Calculate silhouette score for KMeans
silhouette_kmeans_icp_all = silhouette_score(pca_transformed_ICP, df2['KMeans_Cluster_ICP_All'])
# GMM Clustering
gmm_ICP = GaussianMixture(n_components=3, random_state=42)
df2['GMM_Cluster_ICP_All'] = gmm_ICP.fit_predict(pca_transformed_ICP)
# Calculate silhouette score for GMM
silhouette_gmm_icp_all = silhouette_score(pca_transformed_ICP, df2['GMM_Cluster_ICP_All'])
# Ensemble Clustering (Majority Voting for Full Components)
kmeans_labels_all = df2['KMeans_Cluster_ICP_All'].to_numpy()
gmm_labels_all = df2['GMM_Cluster_ICP_All'].to_numpy()
# Combine labels from KMeans and GMM
refined_labels_all = np.array([kmeans_labels_all, gmm_labels_all])
ensemble_labels_all = mode(refined_labels_all, axis=0)[0].flatten()
df2['Ensemble_Cluster_ICP_All'] = ensemble_labels_all
# Silhouette score for Ensemble
silhouette_ensemble_icp_all = silhouette_score(pca_transformed_ICP, ensemble_labels_all)
# ===== First Two Components Clustering =====
# KMeans clustering
kmeans_2 = KMeans(n_clusters=3, random_state=42)
df2['KMeans_Cluster_ICP_Reduced'] = kmeans_2.fit_predict(pca_transformed_ICP[:, :2])
# Silhouette score for KMeans
silhouette_kmeans_icp_reduced = silhouette_score(pca_transformed_ICP[:, :2], df2['KMeans_Cluster_ICP_Reduced'])
# GMM clustering
gmm_2 = GaussianMixture(n_components=3, random_state=42)
df2['GMM_Cluster_ICP_Reduced'] = gmm_2.fit_predict(pca_transformed_ICP[:, :2])
# Silhouette score for GMM
silhouette_gmm_icp_reduced = silhouette_score(pca_transformed_ICP[:, :2], df2['GMM_Cluster_ICP_Reduced'])
# Ensemble Clustering (Majority Voting for First Two Components)
kmeans_labels_reduced = df2['KMeans_Cluster_ICP_Reduced'].to_numpy()
gmm_labels_reduced = df2['GMM_Cluster_ICP_Reduced'].to_numpy()
# Combine labels from KMeans and GMM
refined_labels_reduced = np.array([kmeans_labels_reduced, gmm_labels_reduced])
ensemble_labels_reduced = mode(refined_labels_reduced, axis=0)[0].flatten()
df2['Ensemble_Cluster_ICP_Reduced'] = ensemble_labels_reduced
# Silhouette score for Ensemble
silhouette_ensemble_icp_reduced = silhouette_score(pca_transformed_ICP[:, :2], ensemble_labels_reduced)
# ===== Compare Silhouette Scores =====
print("Full Components Clustering Silhouette Scores:")
print(f"KMeans Silhouette Score (All Components): {silhouette_kmeans_icp_all:.4f}")
print(f"GMM Silhouette Score (All Components): {silhouette_gmm_icp_all:.4f}")
print(f"Ensemble Silhouette Score (All Components): {silhouette_ensemble_icp_all:.4f}")
print("\nFirst Two Components Clustering Silhouette Scores:")
print(f"KMeans Silhouette Score (Reduced Components): {silhouette_kmeans_icp_reduced:.4f}")
print(f"GMM Silhouette Score (Reduced Components): {silhouette_gmm_icp_reduced:.4f}")
print(f"Ensemble Silhouette Score (Reduced Components): {silhouette_ensemble_icp_reduced:.4f}")
Full Components Clustering Silhouette Scores: KMeans Silhouette Score (All Components): 0.5157 GMM Silhouette Score (All Components): 0.5157 Ensemble Silhouette Score (All Components): 0.5157 First Two Components Clustering Silhouette Scores: KMeans Silhouette Score (Reduced Components): 0.6829 GMM Silhouette Score (Reduced Components): 0.6829 Ensemble Silhouette Score (Reduced Components): 0.6829
Findings¶
The first two components outperform full components:
- KMeans (Reduced Components): 0.6829 vs. KMeans (All Components): 0.5157
- GMM (Reduced Components): 0.6829 vs. GMM (All Components): 0.2625
The first two PCA components capture the majority of the variance and seem to provide better-defined clusters, while the inclusion of additional components in "Full Components" likely introduces noise or irrelevant features, degrading clustering performance. Ensemble Follows the Trends:
For both cases, the ensemble silhouette score aligns closely with the lowest-performing model in the ensemble (e.g., GMM for "Full Components"). In the "First Two Components" scenario, the ensemble matches the top-performing models (KMeans and GMM), as both contribute equally strong clusters. Full Components are Less Effective:
The sharp drop in GMM’s silhouette score for full components indicates that it struggles with the added dimensions, likely due to noise or less relevant patterns in the data. Hereafter, 2-components will be used in the model.
# Create a comparison table for ICP-specific clusters
comparison_table = pd.crosstab(
df2['KMeans_Cluster_ICP'],
df2['GMM_Cluster_ICP'],
rownames=['KMeans_ICP'],
colnames=['GMM_ICP']
)
# Make the row index (KMeans_ICP) 1-indexed
comparison_table.index = comparison_table.index + 1
comparison_table.index.name = 'KMeans_ICP (1-Indexed)'
# Make the column index (GMM_ICP) 1-indexed
comparison_table.columns = comparison_table.columns + 1
comparison_table.columns.name = 'GMM_ICP (1-Indexed)'
# Display the updated cross-tab
print("\nComparison Table - 1-Indexed:")
print(comparison_table)
Comparison Table - 1-Indexed: GMM_ICP (1-Indexed) 1 2 3 KMeans_ICP (1-Indexed) 1 386 0 0 2 0 224 0 3 0 0 50
import matplotlib.pyplot as plt
# Convert the comparison table to a format suitable for plotting
comparison_table_reset = comparison_table.reset_index()
comparison_table_melted = comparison_table_reset.melt(
id_vars='KMeans_ICP (1-Indexed)',
var_name='GMM_ICP (1-Indexed)',
value_name='Count'
)
# Remove rows where Count is zero
comparison_table_melted = comparison_table_melted[comparison_table_melted['Count'] > 0]
# Create the bubble plot
plt.figure(figsize=(9, 5))
bubble_plot = plt.scatter(
comparison_table_melted['GMM_ICP (1-Indexed)'],
comparison_table_melted['KMeans_ICP (1-Indexed)'],
s=comparison_table_melted['Count'] * 10, # Scale bubble size
alpha=0.6,
c='blue',
edgecolors='black'
)
# Add labels and title
plt.title('Comparison of KMeans and GMM Clusters', fontsize=14)
plt.xlabel('GMM Cluster (1-Indexed)', fontsize=12)
plt.ylabel('KMeans Cluster (1-Indexed)', fontsize=12)
plt.xticks(comparison_table.columns)
plt.yticks(comparison_table.index)
plt.grid(True, linestyle='--', alpha=0.6)
# Add annotations for counts
for _, row in comparison_table_melted.iterrows():
plt.text(
row['GMM_ICP (1-Indexed)'],
row['KMeans_ICP (1-Indexed)'],
str(row['Count']),
color='black',
ha='center',
va='center',
fontsize=10
)
# Show the plot
plt.tight_layout()
plt.show()
Observation¶
Grouping the customers within clusters helps with evaluting cluster consistency, as well as identifying stable clusters. There is a strong indication that larger groups of customers are more representative, and when these are common across the models, a robust consensus results. Quantative scoring will add to these insights.
# Visualize with t-SNE
# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42) # Adjust perplexity if needed
tsne_results = tsne.fit_transform(pca_transformed_ICP[:, :2]) # Use PCA-reduced data (first 2 components)
# Create a t-SNE dataframe for ICP clusters
tsne_df2_icp = pd.DataFrame(tsne_results, columns=['TSNE1', 'TSNE2'], index=df2.index)
# Add ICP-specific cluster labels and relevant features
tsne_df2_icp['KMeans_Cluster_ICP'] = df2['KMeans_Cluster_ICP']
tsne_df2_icp['GMM_Cluster_ICP'] = df2['GMM_Cluster_ICP']
tsne_df2_icp['KMedoids_Cluster_ICP'] = df2['KMedoids_Cluster_ICP']
# Select hover fields to keep it simple
fields_to_include = ['Avg_Credit_Limit', 'Total_Credit_Cards', 'Total_visits_bank', 'Total_visits_online']
tsne_df2_icp = tsne_df2_icp.join(df2[fields_to_include])
# KMeans
fig_kmeans_icp = px.scatter(
tsne_df2_icp, x='TSNE1', y='TSNE2', color='KMeans_Cluster_ICP',
hover_data=fields_to_include, # Fields for hover information
title='t-SNE Visualization with KMeans Clusters (ICP)',
color_continuous_scale='Viridis'
)
fig_kmeans_icp.update_layout(autosize=False, width=800, height=350, coloraxis_showscale=False)
fig_kmeans_icp.show()
# GMM
fig_gmm_icp = px.scatter(
tsne_df2_icp, x='TSNE1', y='TSNE2', color='GMM_Cluster_ICP',
hover_data=fields_to_include,
title='t-SNE Visualization with GMM Clusters (ICP)',
color_continuous_scale='Viridis'
)
fig_gmm_icp.update_layout(autosize=False, width=800, height=350, coloraxis_showscale=False)
fig_gmm_icp.show()
# K-Medoids
fig_kmedoids_icp = px.scatter(
tsne_df2_icp, x='TSNE1', y='TSNE2', color='KMedoids_Cluster_ICP',
hover_data=fields_to_include,
title='t-SNE Visualization with K-Medoids Clusters (ICP)',
color_continuous_scale='Viridis'
)
fig_kmedoids_icp.update_layout(autosize=False, width=800, height=350, coloraxis_showscale=False)
fig_kmedoids_icp.show()
Interpretation¶
Both KMeans and GMM have high silhouette scores, confirmed in their distinct and consistent clusters with clear separations in the t-SNE graph above. They are effective in capturing the underlying structure of the data, making them very useful in defining the Ideal Customer Profile. The above t-SNE confirms that KMedoids will not be helpful for creating this profile. The use of KMeans and GMM, on the other hand, provides a robust consensus clustering, with cluster alignment that can be used to leaverage their respective strengths: KMeans for its simplicity, and GMM for its flexibility in capturing nuanced patterns, especially in overlapping groups.
# Ideal Customer Profile
from IPython.display import Image, display
# Display the image
display(Image(filename='/mnt/e/mikecbos_E/Downloads/MIT_Elective-AllLife/ICP_Personas.png'))
# Get existing cluster combinations in df2
valid_combinations = df2.groupby(['KMeans_Cluster_ICP', 'GMM_Cluster_ICP']).size().reset_index()
valid_combinations.columns = ['KMeans_Cluster_ICP', 'GMM_Cluster_ICP', 'Count']
# Align cluster indexing
ICP_summaries = {}
for kmeans_cluster, gmm_cluster in valid_combinations[['KMeans_Cluster_ICP', 'GMM_Cluster_ICP']].values:
# Align 1-indexed pertinent_clusters with 0-indexed df2 clusters
kmeans_filter = kmeans_cluster
gmm_filter = gmm_cluster
# Filter rows in df2 matching this cluster combination
cluster_data = df2[
(df2['KMeans_Cluster_ICP'] == kmeans_filter) &
(df2['GMM_Cluster_ICP'] == gmm_filter)
]
# Summarize pertinent features
ICP_summaries[f"KMeans {kmeans_cluster + 1}, GMM {gmm_cluster + 1}"] = cluster_data[
['Avg_Credit_Limit', 'Total_Credit_Cards',
'Total_visits_bank', 'Total_visits_online', 'Total_calls_made']
].mean()
# Convert ICP summaries to DataFrame
ICP_summaries_df = pd.DataFrame(ICP_summaries).T
# Update the index for readability
ICP_summaries_df.index.name = "Cluster Combination"
# Display the ICP summaries
print("\nIdeal Customer Profiles (ICP Summaries):")
print(ICP_summaries_df)
Ideal Customer Profiles (ICP Summaries):
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank \
Cluster Combination
KMeans 1, GMM 1 33782.383420 5.515544 3.489637
KMeans 2, GMM 2 12174.107143 2.410714 0.933036
KMeans 3, GMM 3 141040.000000 8.740000 0.600000
Total_visits_online Total_calls_made
Cluster Combination
KMeans 1, GMM 1 0.981865 2.000000
KMeans 2, GMM 2 3.553571 6.870536
KMeans 3, GMM 3 10.900000 1.080000
Personas¶
Low-income Cluster: KMeans 1, GMM 1
- Average Credit Limit: $12,174 – Relatively low compared to other clusters.
- Total Credit Cards: 2.41 – Indicates a conservative number of credit cards.
- Bank Visits: 0.93 – Very few visits to the bank, suggesting reliance on other channels.
- Online Visits: 3.55 – Moderate online activity.
- Calls Made: 6.87 – High reliance on phone interactions.
- Profile: This persona likely represents low-to-moderate income customers who prefer phone communication and engage moderately with online banking.
Traditional Communication Preference Cluster: KMeans 2, GMM 2
- Average Credit Limit: $33,782 – Indicates mid-tier customers with good credit access.
- Total Credit Cards: 5.52 – A significantly higher number of credit cards.
- Bank Visits: 3.49 – High frequency of bank visits.
- Online Visits: 0.98 – Very low online activity.
- Calls Made: 2.00 – Minimal phone engagement.
- Profile: This persona likely represents traditional customers who rely on in-person banking and have moderate financial resources. Their low online activity indicates limited digital adoption.
Online Communication Preference Cluster: KMeans 3, GMM 3
- Average Credit Limit: $141,040 – Very high credit limit.
- Total Credit Cards: 8.74 – A large number of credit cards.
- Bank Visits: 0.60 – Rarely visits the bank.
- Online Visits: 10.90 – Heavy online activity.
- Calls Made: 1.08 – Minimal phone interactions.
- Profile: This persona represents affluent, tech-savvy customers who prefer online banking and have substantial financial resources. Their low reliance on in-person or phone communication suggests a preference for self-service digital platforms.
ServiceDissatisfactionAnalysis¶
Exploratory Data Analysis of 5 Components vs 2 Components¶
Building on the earlier Upselling study, which utilized all components, and the ICP) study, which focused on 2 (reduced) components, these studies together provide a framework for exploring Service Dissatisfaction. Specifically, they help analyze the respective contribution scores of the components and evaluate how these arrangements might apply.
# Heatmap to compare All vs Reduced (2) Components
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Define features to apply PCA
features = ['Avg_Credit_Limit', 'Total_Credit_Cards',
'Total_visits_bank', 'Total_visits_online', 'Total_calls_made']
# Standardize features before PCA
scaler = StandardScaler()
df2_scaled = scaler.fit_transform(df2[features])
# Apply PCA
pca = PCA()
pca_transformed_ICP = pca.fit_transform(df2_scaled)
# PCA Loadings Matrix (All Components)
loadings_ICP = pd.DataFrame(
pca.components_,
columns=['Avg_Credit_Limit', 'Total_Credit_Cards',
'Total_visits_bank', 'Total_visits_online', 'Total_calls_made'],
index=[f'PCA_{i+1}' for i in range(pca.components_.shape[0])]
)
# Reduced Components Loadings (First 2 Components)
reduced_loadings = pd.DataFrame(
pca.components_[:2], # Use the first 2 components
columns=['Avg_Credit_Limit', 'Total_Credit_Cards',
'Total_visits_bank', 'Total_visits_online', 'Total_calls_made'],
index=['PCA_1', 'PCA_2'] # Label reduced components
)
# Plot side-by-side heatmaps
fig, axes = plt.subplots(1, 2, figsize=(18, 8), gridspec_kw={'width_ratios': [5, 3]})
# All Components Heatmap
sns.heatmap(loadings_ICP, annot=True, cmap='coolwarm', ax=axes[0])
axes[0].set_title('All Components Heatmap', fontsize=14)
axes[0].set_xticklabels(axes[0].get_xticklabels(), fontsize=10, rotation=45)
axes[0].set_yticklabels(axes[0].get_yticklabels(), fontsize=10)
axes[0].set_xlabel('Features', fontsize=12)
axes[0].set_ylabel('Principal Components (All)', fontsize=12)
# Reduced Components Heatmap (First 2 Components)
sns.heatmap(reduced_loadings, annot=True, cmap='coolwarm', ax=axes[1], cbar=False)
axes[1].set_title('Reduced Components Heatmap', fontsize=14)
axes[1].set_xticklabels(axes[1].get_xticklabels(), fontsize=10, rotation=45)
axes[1].set_yticklabels(axes[1].get_yticklabels(), fontsize=10)
axes[1].set_xlabel('Features', fontsize=12)
axes[1].set_ylabel('Principal Components (Reduced)', fontsize=12)
# Adjust layout
plt.tight_layout()
plt.show()
Observation:_Divergence_vs_Convergence¶
The ICP prior study was convergent in its focus on identifying the Ideal Customer. Like the Upselling study, this study is also divergent - exploring the multiple ways through which diverse customers experience dissatisfaction. While the Reduced (2) Component arrangement works well for the Ideal Customer Profile and using all components is crucial for Upselling, Service Dissatisfaction requires the consideration of both high positive (red) and high negative (blue) values, rather than focusing solely on high or average values. Highlighting the extreme contribution scores from each principal component will offer valuable insights in this context.
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Define thresholds for highlighting extreme values
upper_threshold = 0.5 # +high%
lower_threshold = -0.5 # -high%
# Mask all non-extreme values
masked_data = loadings_ICP.copy() # Assuming `loadings_ICP` is your PCA loadings DataFrame
masked_data[(masked_data < upper_threshold) & (masked_data > lower_threshold)] = np.nan # Mask non-extreme values
# Plot the heatmap with extreme values highlighted
plt.figure(figsize=(10, 6))
sns.heatmap(masked_data, annot=True, cmap='coolwarm', fmt='.2f', cbar=False,
vmax=1, vmin=-1, linewidths=0.5, linecolor='black')
# Set the title
plt.title('', fontsize=14)
# Move Y-axis labels to the top
plt.xticks(fontsize=10, rotation=45)
plt.yticks(fontsize=10)
plt.gca().xaxis.tick_top() # Move X-axis ticks to the top
plt.gca().xaxis.set_label_position('top') # Set X-axis labels as top-aligned
plt.xlabel('', fontsize=12)
plt.ylabel('Features', fontsize=12) # Label Y-axis for clarity
# Adjust layout
plt.tight_layout()
plt.show()
# Define features for SVC
features = ['Avg_Credit_Limit', 'Total_Credit_Cards',
'Total_visits_bank', 'Total_visits_online', 'Total_calls_made']
# Standardize features for SVC
scaler = StandardScaler()
df2_scaled_SVC = scaler.fit_transform(df2[features])
# Apply PCA for SVC
pca_SVC = PCA()
pca_transformed_SVC = pca_SVC.fit_transform(df2_scaled)
# Define the PCA Loadings Matrix
loadings_SVC = pd.DataFrame(
pca_SVC.components_,
columns=features,
index=[f'PCA_{i+1}' for i in range(pca_SVC.components_.shape[0])]
)
# Define thresholds for extreme values
upper_threshold = 0.5
lower_threshold = -0.5
# Create reduced subsets
reduced_loadings = {
"PC1 + PC2": loadings_SVC[:2],
"PC1 + PC2 + PC3": loadings_SVC[:3],
"PC1 + PC2 + PC3 + PC4": loadings_SVC[:4]
}
# Apply extreme value mask to each subset
masked_loadings = {
label: subset.where((subset > upper_threshold) | (subset < lower_threshold), np.nan)
for label, subset in reduced_loadings.items()
}
# Plot thumbnails in a grid layout
fig, axes = plt.subplots(1, 3, figsize=(20, 6), sharey=False)
for ax, (label, data) in zip(axes, masked_loadings.items()):
sns.heatmap(data, annot=True, cmap='coolwarm', fmt='.2f', cbar=False, ax=ax)
ax.set_title(label, fontsize=14)
ax.set_xlabel('', fontsize=12)
ax.set_ylabel('', fontsize=12)
ax.tick_params(axis='x', labelrotation=45, labelsize=10)
ax.tick_params(axis='y', labelsize=10)
plt.tight_layout()
plt.show()
Observation¶
The above heatmap highlights both high and low contribution scores for each principal component. Reviewing the contribution scores of principal components 2 through 4, specifically in relation to each feature, offers an additional perspective.
DecisionPoint¶
When reviewing the extreme contribution scores of each principal component to the features, PCA_3 appears redundant due to overlapping feature representation:
Total_visits_bank: The contribution of PCA_3 is already well-represented by PCA_2.
Total_calls_made: The contribution of PCA_3 is already well-represented by PCA_1.
Explained_Variance_Contribution_as_a_metric¶
The Explained Variance Contribution metric is ideal for this Service Dissatisfaction analysis because it quantifies how much of the total variance in the data is captured by each principal component. This includes both positive and negative contributions, ensuring a comprehensive view of the data's structure. This metric will help evaluate the exclusion of PCA_3.
# Explained Variance Contribution Comparison
import matplotlib.pyplot as plt
# All Components - Explained Variance Contribution
explained_variance_ratio = pca_SVC.explained_variance_ratio_
cumulative_variance = explained_variance_ratio.cumsum()
# Selected Components - Explained Variance Contribution
selected_indices = [0, 1, 3, 4] # Indices corresponding to PC1, PC2, PC4, PC5
selected_explained_variance = explained_variance_ratio[selected_indices]
cumulative_selected_variance = selected_explained_variance.cumsum()
# Create a side-by-side comparison plot
fig, axes = plt.subplots(1, 2, figsize=(16, 6), sharey=True)
# Plot All Components
axes[0].bar(
range(1, len(explained_variance_ratio) + 1),
explained_variance_ratio, alpha=0.7, label='Individual Explained Variance'
)
axes[0].step(
range(1, len(cumulative_variance) + 1),
cumulative_variance, where='mid', color='red', label='Cumulative Explained Variance'
)
axes[0].set_title("Explained Variance (All Components)", fontsize=14)
axes[0].set_xlabel("Principal Component", fontsize=12)
axes[0].set_ylabel("Variance Explained", fontsize=12)
axes[0].legend(loc='best')
# Plot Selected Components
axes[1].bar(
[1, 2, 4, 5],
selected_explained_variance, alpha=0.7, label='Individual Explained Variance'
)
axes[1].step(
[1, 2, 4, 5],
cumulative_selected_variance, where='mid', color='red', label='Cumulative Explained Variance'
)
axes[1].set_title("Explained Variance (Reduced Components)", fontsize=14)
axes[1].set_xlabel("Principal Component", fontsize=12)
#axes[1].legend(loc='best')
# Adjust layout
plt.tight_layout()
plt.show()
Findings¶
- PCA_1 and PCA_2 dominate the explained variance, capturing the majority of variability in the data.
- Efficient Variance Retention: PCA_1, PCA_2, PCA_4, and PCA_5 collectively capture most of the variance, confirming the redundancy of PCA_3, and the above Decision Point to exclude it in this Service Dissatisfaction study.
- The reduced component set (PCA_1, PCA_2, PCA_4, PCA_5) balances simplicity and variance retention, making it ideal for clustering and interpretation.
# Clustering Analysis: Elbow Plot
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Index 0 corresponds to PC1, so the desired components are 0, 1, 3, 4
SVC_selected_components = pca_transformed_SVC[:, [0, 1, 3, 4]]
# Define range of cluster numbers to evaluate
cluster_range = range(1, 10) # Try 1 to 9 clusters
inertia_values = []
# Compute inertia for each number of clusters
for k in cluster_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(SVC_selected_components)
inertia_values.append(kmeans.inertia_)
# Plot the elbow curve
plt.figure(figsize=(8, 6))
plt.plot(cluster_range, inertia_values, marker='o', linestyle='-')
plt.xlabel('Number of Clusters', fontsize=12)
plt.ylabel('Inertia (Sum of Squared Distances)', fontsize=12)
plt.title('Elbow Plot for Optimal Clusters', fontsize=14)
plt.xticks(cluster_range, fontsize=10)
plt.yticks(fontsize=10)
plt.grid(True)
plt.show()
Finding¶
3 Clusters is optimal for this Service Dissatisfaction analysis
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn_extra.cluster import KMedoids
from sklearn.metrics import silhouette_score
import numpy as np
from scipy.stats import mode
# Evaluation of silhouette scores for 3 Clusters
# Selected PCA components for SVC
SVC_selected_components = pca_transformed_SVC[:, [0, 1, 3, 4]]
# KMeans clustering with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(SVC_selected_components)
# Add KMeans cluster labels to the DataFrame
df2['SVC_KMeans_Cluster'] = kmeans_labels
# Calculate silhouette score for KMeans
silhouette_kmeans = silhouette_score(SVC_selected_components, kmeans_labels)
# GMM clustering with 3 clusters
gmm = GaussianMixture(n_components=3, random_state=42)
gmm_labels = gmm.fit_predict(SVC_selected_components)
# Add GMM cluster labels to the DataFrame
df2['SVC_GMM_Cluster'] = gmm_labels
# Calculate silhouette score for GMM
silhouette_gmm = silhouette_score(SVC_selected_components, gmm_labels)
# K-Medoids clustering with 3 clusters
kmedoids = KMedoids(n_clusters=3, random_state=42)
kmedoids_labels = kmedoids.fit_predict(SVC_selected_components)
# Add K-Medoids cluster labels to the DataFrame
df2['SVC_KMedoids_Cluster'] = kmedoids_labels
# Calculate silhouette score for K-Medoids
silhouette_kmedoids = silhouette_score(SVC_selected_components, kmedoids_labels)
# Ensemble Voting for Final Clusters
# Combine the cluster assignments from all methods
cluster_results = np.array([kmeans_labels, gmm_labels, kmedoids_labels]).T
# Determine the ensemble cluster assignment using majority voting
ensemble_labels = mode(cluster_results, axis=1)[0].flatten()
# Add ensemble cluster labels to the DataFrame
df2['SVC_Ensemble_Cluster'] = ensemble_labels
# Calculate silhouette score for Ensemble
silhouette_ensemble = silhouette_score(SVC_selected_components, ensemble_labels)
# Print silhouette scores
print(f"KMeans Silhouette Score: {silhouette_kmeans:.4f}")
print(f"GMM Silhouette Score: {silhouette_gmm:.4f}")
print(f"K-Medoids Silhouette Score: {silhouette_kmedoids:.4f}")
print(f"Ensemble Silhouette Score: {silhouette_ensemble:.4f}")
KMeans Silhouette Score: 0.5672 GMM Silhouette Score: 0.5672 K-Medoids Silhouette Score: 0.3787 Ensemble Silhouette Score: 0.5672
Decision Point: Ensemble Clustering¶
While the silhouette scores for KMeans and GMM are equal and the ensemble does not show a quantitative improvement in clustering quality, leveraging an ensemble approach allows us to combine the strengths of all three models:
KMeans' simplicity in identifying well-separated, defined clusters.
GMM's probabilistic modeling, which captures overlapping or elliptical clusters and accounts for uncertainty in assignments.
K-Medoids' robustness to outliers, providing a valuable perspective on extreme cases that might represent significant dissatisfaction or unique customer behaviors.
This combined framework is particularly valuable when analyzing extreme values (high and low probabilities) and outliers. By integrating the robustness of K-Medoids with the strengths of KMeans and GMM, the ensemble clustering approach uncovers nuanced patterns that may not be evident from any model independently. This is essential for identifying and addressing service dissatisfaction and understanding edge cases, enabling better-targeted strategies and improved service quality.
# Cluster Summary Data Table
# Cluster summary for ensemble clusters
SVC_cluster_summary = df2.groupby('SVC_Ensemble_Cluster')[
['Avg_Credit_Limit', 'Total_Credit_Cards',
'Total_visits_bank', 'Total_visits_online', 'Total_calls_made']
].mean()
print("Cluster Summary (Average values):")
# 1-Indexed for Cluster ID
SVC_cluster_summary.index = SVC_cluster_summary.index + 1
print(SVC_cluster_summary)
Cluster Summary (Average values):
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank \
SVC_Ensemble_Cluster
1 33782.383420 5.515544 3.489637
2 141040.000000 8.740000 0.600000
3 12174.107143 2.410714 0.933036
Total_visits_online Total_calls_made
SVC_Ensemble_Cluster
1 0.981865 2.000000
2 10.900000 1.080000
3 3.553571 6.870536
Context for the 3 clusters¶
Cluster 1: Represents low credit limits (avg. $12,174), fewer credit cards (avg. 2.4), moderate online visits (avg. 3.55), and high call activity (avg. 6.87).
Cluster 2: Features moderate credit limits (avg. '$33,782), mid-level credit cards (avg. 5.5), high bank visits (avg. 3.49), and minimal online visits (avg. 0.98).
Cluster 3: Characterized by high credit limits (avg. $141,040), many credit cards (avg. 8.7), low bank visits (avg. 0.6), and frequent online activity (avg. 10.9), with very few calls (avg. 1.08).
from scipy.spatial.distance import jensenshannon
# Evaluate extreme high and extreme low values between clusters distributions
# Calculate feature distributions for each cluster
feature_distributions = df2.groupby('SVC_Ensemble_Cluster')[features].mean()
# Jensen-Shannon divergence between clusters
js_divergences = np.zeros((len(feature_distributions), len(feature_distributions)))
for i in range(len(feature_distributions)):
for j in range(len(feature_distributions)):
js_divergences[i, j] = jensenshannon(feature_distributions.iloc[i], feature_distributions.iloc[j])
# Convert divergence matrix to DataFrame for visualization
js_divergences_df = pd.DataFrame(
js_divergences,
index=[f"Cluster {i+1}" for i in feature_distributions.index],
columns=[f"Cluster {i+1}" for i in feature_distributions.index]
)
print("Jensen-Shannon Divergence Matrix:")
print(js_divergences_df)
print("\n")
# Heatmap visualization
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
sns.heatmap(js_divergences_df, annot=True, cmap="coolwarm", fmt=".3f")
plt.title("Jensen-Shannon Divergence Between Clusters")
plt.show()
Jensen-Shannon Divergence Matrix:
Cluster 1 Cluster 2 Cluster 3
Cluster 1 0.000000 0.007553 0.013505
Cluster 2 0.007553 0.000000 0.015791
Cluster 3 0.013505 0.015791 0.000000
Observations¶
The Jensen-Shannon Divergence (JSD) measures divergence between cluster distributions taking into account the entire spectrum of values (positive and negative). High divergence is observed between Clusters 1 and 3, and Clusters 1 and 2 (red highlight), indicating distinct behavioral or characteristic patterns. Low divergence (grey highlight) is observed between Cluster 2 and 3, suggesting overlap or shared characteristics.
High divergence highlights clusters at the extremes of service dissatisfaction or user characteristics, essential for targeting specific behaviors or needs. Low divergence reveals clusters with potentially overlapping behaviors, aiding in refining cluster boundaries or exploring transitional patterns.
from sklearn.manifold import TSNE
import plotly.express as px
import pandas as pd
from sklearn.cluster import KMeans
# Step 1: Apply t-SNE to reduce to 2D space
tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=1000)
tsne_results = tsne.fit_transform(SVC_selected_components)
# Step 2: Create a DataFrame for visualization
tsne_df = pd.DataFrame(tsne_results, columns=['t-SNE_1', 't-SNE_2'])
tsne_df['SVC_KMeans_Cluster'] = df2['SVC_KMeans_Cluster']
tsne_df['SVC_GMM_Cluster'] = df2['SVC_GMM_Cluster']
tsne_df['SVC_KMedoids_Cluster'] = df2['SVC_KMedoids_Cluster']
# Step 3: Visualize t-SNE for each clustering model using Plotly
# KMeans Clustering
fig_kmeans = px.scatter(
tsne_df,
x='t-SNE_1',
y='t-SNE_2',
color='SVC_KMeans_Cluster',
title='t-SNE Visualization for KMeans Clustering',
labels={'color': 'SVC_KMeans Cluster'}
)
fig_kmeans.update_layout(autosize=False, width=800, height=350, coloraxis_showscale=False) # Set dimensions
fig_kmeans.show()
# GMM Clustering
fig_gmm = px.scatter(
tsne_df,
x='t-SNE_1',
y='t-SNE_2',
color='SVC_GMM_Cluster',
title='t-SNE Visualization for GMM Clustering',
labels={'color': 'SVC_GMM Cluster'}
)
fig_gmm.update_layout(autosize=False, width=800, height=350, coloraxis_showscale=False) # Set dimensions
fig_gmm.show()
# KMedoids Clustering
fig_kmedoids = px.scatter(
tsne_df,
x='t-SNE_1',
y='t-SNE_2',
color='SVC_KMedoids_Cluster',
title='t-SNE Visualization for KMedoids Clustering',
labels={'color': 'SVC_KMedoids Cluster'}
)
fig_kmedoids.update_layout(autosize=False, width=800, height=350, coloraxis_showscale=False) # Set dimensions
fig_kmedoids.show()
# Select specific principal components (PC1, PC2, PC4, PC5)
# Index 0 corresponds to PC1, so the desired components are 0, 1, 3, 4
SVC_selected_components = pca_transformed_SVC[:, [0, 1, 3, 4]]
# Use the selected components for clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(SVC_selected_components)
# Add cluster labels to the original DataFrame
df2['SVC_Cluster_Selected'] = kmeans_labels
# Filter loadings for PC1, PC2, PC4, PC5
SVC_selected_loadings = loadings_SVC.loc[['PCA_1', 'PCA_2', 'PCA_4', 'PCA_5']]
print("Contribution Weighting: Service Dissatisfaction Loadings (unbounded)")
print("\n")
print(SVC_selected_loadings)
Contribution Weighting: Service Dissatisfaction Loadings (unbounded)
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank \
PCA_1 0.488859 0.597679 0.280492
PCA_2 0.403240 0.030171 -0.586587
PCA_4 -0.308617 0.741352 -0.445278
PCA_5 0.709337 -0.105122 -0.050586
Total_visits_online Total_calls_made
PCA_1 0.111783 -0.559129
PCA_2 0.665161 0.223527
PCA_4 -0.318388 0.235605
PCA_5 -0.592200 0.364047
Observations¶
The t-SNE visualizations confirms the valid cluster separations, and the corresponding Contribution Weighting table provides insights into the drivers of cluster formations:
- PCA_1 is heavily influenced by credit-related factors and inversely related to call activity. This suggests any dissatisfaction might be related to account or product management, and less so on customer service calls
- PCA_2 is driven by online interactions over in-person banking. This suggests any dissatisfaction might be related to frustrations with online engagement, and less influenced by in-person visits.
- PCA_4 has strong negative contribution from Total_Credit_Cards, and moderately positive contributions from combined Total_visits_bank and Total_visits_online. This suggests any dissatisfaction is tied to low credit card usage, and frequent in-person and online interactions.
- PCA_5 has strong negative contribution from Avg_Credit_Limit and moderately positive contributions from Total_visits_online and Total_Credit_Cards. This suggests dissatisfaction may be associated with lower credit limits and high online activity.
# t-SNE Ensemble
tsne_df['SVC_Ensemble_Cluster'] = df2['SVC_Ensemble_Cluster']
fig = px.scatter(
tsne_df,
x='t-SNE_1',
y='t-SNE_2',
color='SVC_Ensemble_Cluster',
title='t-SNE Visualization for Ensemble Clustering',
labels={'color': 'SVC Ensemble Cluster'}
)
fig.update_layout(autosize=False, width=800, height=350, coloraxis_showscale=False)
fig.show()
Observation¶
The above ensemble clustering demonstrates high fidelity as a collective representation of individual models by consolidating their strengths: KMeans' simplicity, GMM's probabilistic overlaps, and KMedoids' outlier handling. This integration forms a robust foundation for uncovering unique insights into service dissatisfaction, examining edge cases, and analyzing cluster overlaps. By leveraging the ensemble's comprehensive perspective, we can identify nuanced patterns and address specific business challenges effectively. The ensemble together represents a balanced, integrated view, visualized by the clusters-to-components 3D plot below.
# 3D Visualize Components within Clusters
# Create a DataFrame with reduced components for interactive visualization
# Adjust the column names to match the number of components
pca_df = pd.DataFrame(
SVC_selected_components,
columns=['PC1', 'PC2', 'PC4', 'PC5'] # Use only the selected components
)
pca_df['Cluster'] = df2['SVC_Ensemble_Cluster'] + 1 # Add cluster labels
# Define which components to plot
component_x = 'PC5' # Replace with desired component
component_y = 'PC2' # Replace with desired component
component_z = 'PC4' # Replace with desired component
color_component = 'PC1' # Component for coloring the points
hover_data = {
'Cluster': True,
'PC1': True,
'PC2': True,
'PC4': True,
'PC5': True
}
fig = px.scatter_3d(
pca_df,
x=component_x,
y=component_y,
z=component_z,
color=color_component, # Color points by the selected component
title=f"3D Visualization ({component_x} vs {component_y} vs {component_z}, Color: {color_component})",
labels={'color': f"{color_component}"}, # Label the color legend
opacity=0.7,
hover_data=hover_data
)
# Update layout for better visuals
fig.update_layout(
width=800,
height=500,
scene=dict(
camera=dict(
up=dict(x=0, y=0, z=1), # Standard upward orientation
center=dict(x=0, y=0, z=0), # Center the plot at the origin
eye=dict(x=1.5, y=0.5, z=1.25) # Camera position
),
xaxis_title=component_x,
yaxis_title=component_y,
zaxis_title=component_z,
xaxis=dict(showticklabels=False),
yaxis=dict(showticklabels=False),
zaxis=dict(showticklabels=False),
aspectratio=dict(
x=1,
y=1,
z=0.8)
),
margin=dict(
l=100,
r=1,
t=25,
b=1
),
coloraxis_colorbar=dict(title=color_component)
)
# Show the plot
fig.show()
Observations¶
The 3D plot reveals clear and distinct clusters, confirming the separation identified through PCA and clustering. The color gradient representing PC1 (which accounts for the majority of variance, as shown in the Explained Variance Contribution Comparison) highlights areas of intensity, such as dissatisfaction extremes driven by credit limits and channel reliance. This plot consolidates the analysis, illustrating the interplay among all four components and their contributions to cluster formation.
- Cluster 1 can be characterized as High Call Depdency: Dissatisfaction arises from financial constraints and high reliance on call-based support.
- Cluster 2 can be characterized as Balanced Usage: Moderate engagement across channels but potential dissatisfaction due to moderate financial access.
- Cluster 3 can be characterized as Digital-First, High-Credit: High online usage with minimal in-person or call dependency, but dissatisfaction may arise from unmet digital service expectations.
Conclusion & Recommendations¶
Conclusion
This comprehensive study offers a multi-faceted view of the customers and banking interactions at AllLife Bank. The dataset reveals insights into both volitional) and non-volitional factors that shape this complex and dynamic environment. Divergent and convergent analyses highlight both incidental and intentional approaches to customer engagement. Successful implementation of the study's findings will require a focus-group approach to validate and refine these insights, combined with an iterative feedback loop to ensure the findings remain as dynamic as the evolving landscape of AllLife Bank.
Recommendations
Use cluster segmentations from this study (K-Means and K-Medoids) in a focus group approach to validate hypotheses and refine upselling strategies. Steps for hypothesis testing would include: 1) Using the identified clusters to define the hypotheses, 2) Select samples from diverse customers from each cluster, and 3) Test and Validate by exploring preferences qualitatively in a focus group approach, and with A/B Testing to measure reponse to targeted offers. Feedback from these findings should be used iteratively to refine the segments and adjust profiles and strategies.
This approach will empower management teams to 1) Validate assumptions while reducing risk, 2) Engage stakeholders for actionable insights, and 3) Create a feedback loop for iterative improvement to ensure upselling strategies are data-driven and effectively tailored to customer behavior.
Use the personas from this study to develop Ideal Customer Profiles (ICPs) to tailor strategies for engagement, retention, and growth. For Low-to-Moderate Prospects/Customers, offer simple, accessible services via phone and online while providing financial literacy programs to empower resource management. This is an incidental approach to stay engaged in their lives if circumstances change. For Traditional Prospects/Customers, focus on personalized in-branch experiences while encouraging digital adoption with incentives to reduce in-branch operational costs. This approach is more intentional to foster loyalty. For Affluent, Tech-Savvy Prospects/Customers, be intentional with this very promising segment to enhance digital platforms to meet high-tech expectations. Premium services (i.e. exclusive rewards or concierge banking) are other examples of intentional engagement with them.
The next step for creating the Ideal Customer Profile is to validate findings with domain experts to ensure strategic alignment and monitor and update profiles based on evolving customer behavior. This will ensure targeted outreach programs.
Customers' perception of support services must be improved by tailoring customer support strategies to the distinct needs of clusters identified in this study. Drivers of dissatisfactions were segmented by an ensemble clustering analysis. Here are specific actions associated with each:
- High Call Depedency / Low Credit
- Reduce over-reliance on their calls and offer self-service, AI-enabled proactive guidance
- Evaluate the strategic value of fostering incidental engagement with these customers against the opportunity cost of prioritizing intentional engagement with higher-value customers
- Package low-cost, low-service, low-maintenance, self-service products to reduce the need for servicing, and thus improve the perception of poor service quality
- Balanced Usage / Moderate Credit
- Maintain a balance across in-person, online, and call services
- Intentionally target this segment for service improvement metrics, create word-of-mouth marketing, and leverage their testimonials to grow their business while combating negative perceptions of poor service quality
- Digital-First / High Credit
- Enhance digital platforms to meet high-credit customers' expectations
- Offer exclusive digital perks and tools to elevate prestige and self-perception
- Use ongoing feedback and adjust support strategies and validate improvements in service perception
Cluster-specific needs as identified by this study can be leveraged to address perception of poor service quality and enhance customer experience and loyalty.